Tags: production engineering*

Production Engineering focuses on the design, implementation, and management of systems and processes to ensure the efficient and reliable delivery of software and services in a production environment. It involves various aspects such as deploying, monitoring, and maintaining applications, managing infrastructure, and handling data pipelines. Production Engineering KPIs include Availability and Cost.

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

    To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

    * **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
    * **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

    The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.
  2. An effort to create a fully functional Kubernetes cluster with 1 million active nodes. The article details the challenges and solutions for scaling Kubernetes to this size, covering networking, state management (etcd), and the scheduler.
  3. This paper provides a theoretical analysis of Transformers' limitations for time series forecasting through the lens of In-Context Learning (ICL) theory, demonstrating that even powerful Transformers often fail to outperform simpler models like linear models. The study focuses on Linear Self-Attention (LSA) models and shows that they cannot achieve lower expected MSE than classical linear models for in-context forecasting, and that predictions collapse to the mean exponentially under Chain-of-Thought inference.
  4. This article explores how prompt engineering can be used to improve time-series analysis with Large Language Models (LLMs), covering core strategies, preprocessing, anomaly detection, and feature engineering. It provides practical prompts and examples for various tasks.
  5. Dozzle is a lightweight, self-hosted solution that provides a real-time look into your container logs, offering an intuitive UI, real-time logging, intelligent search, and support for multiple use cases like home labs and local development.
  6. TraceRoot.AI is an AI-native observability platform that helps developers fix production bugs faster by analyzing structured logs and traces. It offers SDK integration, AI agents for root cause analysis, and a platform for comprehensive visualizations.
  7. TraceRoot accelerates the debugging process with AI-powered insights. It integrates seamlessly into your development workflow, providing real-time trace and log analysis, code context understanding, and intelligent assistance. It offers both a cloud and self-hosted version, with SDKs available for Python and JavaScript/TypeScript.
  8. The article discusses the emergence of 'agentic traffic' – outbound API calls made by autonomous AI agents – and the need for a new infrastructure layer, an 'AI Gateway', to govern and secure this traffic. It outlines the components of an AI Gateway and the importance of security, compliance, and observability in managing agentic AI.
  9. The company's transition from fragmented observability tools to a unified system using OpenTelemetry and OneUptime dramatically improved incident response times, reducing MTTR from 41 to 9 minutes. By correlating logs, metrics, and traces through structured logging and intelligent sampling, they eliminated much of the noise and confusion that previously slowed root cause analysis. The shift also reduced the number of dashboards engineers needed to check per incident and significantly lowered the percentage of incidents with unknown causes.

    Key practices included instrumenting once with OpenTelemetry, enforcing cardinality limits, and archiving raw data for future analysis. The move away from 100% trace capture and over-instrumentation helped manage data volume while maintaining visibility into anomalies. This transformation emphasized that effective observability isn't about collecting more data, but about designing correlated signals that support intentional diagnosis and reduce cognitive load.

Top of the page

First / Previous / Next / Last / Page 2 of 0 SemanticScuttle - klotz.me: tagged with "production engineering"

About - Propulsed by SemanticScuttle